CHALLENGE PRESENTATION ——-
The main goal is to formulate an over/under betting strategy based on the football_example csv file.
METHODOLOGY ——-
1. Identify the type of problem : supervised learning or unsupervised learning ? regression or classification ? Define a error metric to compare models.
The challenge is a regression problem and the Mean Absolute Error (MAE) will be the metric to evaluate the performance.
2. Data understanding - missing values, variable types (One hot encoding : Many models require all variables to be numeric) - Handle missing values if necessary (delete rows or mean/median imputations).
3. Descriptives statistics - response variable distribution. Main statistics (min, max, mean, quantiles, standard deviation, mode (for categorical variables)) for each feature.
4. Correlation analysis between response variable and other numeric variables (or categorical variables after one hot encoding)
5. Data Splitting : training set (used to train algorithms and tune hyper-parameters) and testing set (used to estimate its prediction error (generalization error)). Data from test set won’t be used during model training.
6. Cross validation on the training set in order to define the best strategy.
Comparison of different algorithms (OLS, Ridge, Lasso, … Random Forest, XGBoost) by tuning the hyper-parameters in the cross validation.
7. Identify the best strategy by analyzing the cross validation errors for each model.
8. Retrain the best model (i.e model with best hyper-parameters from cross validation) on the full training set & assess performance on the testing set.
In the following code, we’ll tune some hyper-parameters in the cross validation but without seeking for the optimal combination (grid search).
9. Formulate a over/under betting strategy based on Quantile Prediction with Random Forest. Check the performance and the robustness of the strategy
LOAD PACKAGES
# Disable scientific notation
options(scipen=999)
# Load useful packages ----
# Data manipulation
library(tidyverse)
library(data.table)
library(broom)
# String manipulation
library(stringr)
# Resample data
library(rsample)
library(recipes)
# Correlation Analysis
library(correlationfunnel)
# Machine learning
library(caret)
library(mlr)
library(glmnet)
library(gbm)
library(xgboost)
library(e1071)
library(neuralnet)
library(kernlab)
library(rpart)
library(ranger)
library(randomForest)
library(splines)
# Graphics
library(ggplot2)
library(plotly)
# Data Cleaning
library(janitor)
# EDA
library(skimr)
# Apply mapping functions
library(purrr)
library(furrr) # in parallel
# Rmarkdown
library(epuRate)
library(rmarkdown)
library(knitr)
library(formattable)
library(kableExtra)
# TIming R scripts
library(tictoc)# # Exploring data
# summarize_df<- mlr::summarizeColumns(df)
# there are some algorithms which don’t require you to impute missing values
# You can simply supply them missing data. They take care of missing values on their own
# Let’s see which algorithms are they
# listLearners("regr", check.packages = TRUE, properties = "missings")[c("class","package")]
# To get the list of parameters for any algorithm
# getParamSet("regr.cvglmnet")
# Import data from Github
link_github <- "https://raw.githubusercontent.com/benj7/betting_strategy/master/football_example.csv"
df <- fread(link_github)
# 38,810 rows and 22 variables
# initial data backup
df_backup <- df# Descriptive stats
skim(df)| Name | df |
| Number of rows | 38810 |
| Number of columns | 22 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 15 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| market_id | 0 | 1 | 11 | 11 | 0 | 1719 | 0 |
| league | 0 | 1 | 13 | 14 | 0 | 2 | 0 |
| home | 0 | 1 | 5 | 24 | 0 | 47 | 0 |
| away | 0 | 1 | 5 | 24 | 0 | 47 | 0 |
| bet_type | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
| selection | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| league_id | 0 | 1 | 34.56 | 0.50 | 34.00 | 34.00 | 35.00 | 35.00 | 35.00 | ▆▁▁▁▇ |
| elo_home_ft | 0 | 1 | 1195.63 | 136.51 | 892.67 | 1105.89 | 1174.10 | 1256.89 | 1720.75 | ▂▇▃▁▁ |
| elo_home_fh | 0 | 1 | 1106.59 | 88.32 | 874.44 | 1053.60 | 1095.65 | 1144.09 | 1460.90 | ▁▇▅▁▁ |
| elo_away_ft | 0 | 1 | 1114.49 | 136.44 | 839.25 | 1016.95 | 1105.15 | 1181.36 | 1623.97 | ▃▇▃▁▁ |
| elo_away_fh | 0 | 1 | 1061.47 | 86.53 | 851.85 | 1001.90 | 1052.93 | 1113.96 | 1397.26 | ▂▇▅▂▁ |
| home_score | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| away_score | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| observations | 0 | 1 | 1480.61 | 896.78 | 467.00 | 467.00 | 2274.00 | 2274.00 | 2274.00 | ▆▁▁▁▇ |
| minute | 0 | 1 | 15.00 | 0.00 | 15.00 | 15.00 | 15.00 | 15.00 | 15.00 | ▁▁▇▁▁ |
| final_home_score | 0 | 1 | 1.19 | 1.20 | 0.00 | 0.00 | 1.00 | 2.00 | 9.00 | ▇▃▁▁▁ |
| final_away_score | 0 | 1 | 0.90 | 1.03 | 0.00 | 0.00 | 1.00 | 1.00 | 6.00 | ▇▂▁▁▁ |
| handicap | 0 | 1 | 2.06 | 1.05 | 0.50 | 1.25 | 2.00 | 2.75 | 7.00 | ▇▆▂▁▁ |
| league_prob_at_least | 0 | 1 | 0.44 | 0.22 | 0.00 | 0.28 | 0.40 | 0.57 | 0.90 | ▅▂▇▇▂ |
| odds | 0 | 1 | 1.24 | 1.02 | 0.06 | 0.53 | 0.95 | 1.60 | 7.38 | ▇▂▁▁▁ |
| profit | 0 | 1 | -0.04 | 0.99 | -1.00 | -1.00 | 0.00 | 0.58 | 7.07 | ▇▂▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| match_start_time_gmt | 0 | 1 | 2017-01-06 19:45:00 | 2020-03-10 19:00:00 | 2018-09-02 15:00:00 | 1285 |
It’s very interesting to notice that the average profit is negative: -0.04. If we decide to place all bets, the overall profit is negative, equals to -1721.
Variables “home_score” and “away_score” which are the score at the 15th minute for home and away teams are always equal to 0. This variable is not reliable enough to be used for modelling.
plot_ly(data = df,
x = ~profit,
type = "histogram",
histnorm = "probability") %>%
layout(title = "Frequency distribution of Profit",
xaxis = list(title = "Profit",
zeroline = FALSE),
yaxis = list(title = "Frequency",
zeroline = FALSE))Around 54% of the bets have a profit negative or equals to 0.
The response variable is skewed but we can’t consider log transformation since the target can be negative
Correlation Analysis on data to identify key features that relate to the target (profit)
# Feature Engineering ----
df <- df %>% mutate(
# Rename modalities of the variable "selection"
selection = case_when(selection == "home" ~ "over",
selection == "away" ~ "under"),
# Selection x Handicap
selection_handicap = str_c(selection,
handicap,
sep = "_"),
# Compute difference between elo_home and elo_away
elo_diff_ft = elo_home_ft - elo_away_ft,
elo_diff_fh = elo_home_fh - elo_away_fh,
# Compute difference between league_prod_at_least and implied probability from odds
diff_probs = league_prob_at_least - 1/(odds+1),
# Total number of goals
nb_goals = final_home_score + final_away_score,
# Is profit positive? 1: Yes - 0:No
positive_profit = if_else(profit > 0, 1, 0),
# Ratio between profit and odds
profit_odd_ratio = profit/odds)
df <- as.data.table(df)
df <- df[profit < 0, profit_odd_ratio := profit]
# Round numeric variables to 3 decimal
col_num <- which(sapply(df,class) == "numeric")
sdcols <- names(df)[col_num]
df_num <- df[, lapply(.SD, function(x) round(x,3)), .SDcols = sdcols]
df_no_num <- df[, .SD, .SDcols = setdiff(names(df), sdcols)]
df <- cbind(df_no_num, df_num)
# Correlation Analysis - Identify key features that relate to the target ("profit")
# Transform the data into a binary format
df_binarized <- df %>%
# remove useless variables with no information or variables that we don't know before betting
select(-c(market_id,
match_start_time_gmt,
league_id,
home,
away,
home_score,
away_score,
observations,
minute,
final_home_score,
final_away_score,
minute,
nb_goals,
selection_handicap, # interesting but makes the chart unreadable
profit,
profit_odd_ratio)) %>%
binarize(n_bins = 5 , thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE)
# Perform correlation analysis
df_binarized_corrl <- df_binarized %>%
correlate(positive_profit__1)
# Visualize the feature-target relationships
df_binarized_corrl %>%
plot_correlation_funnel()
Low odds are logically more correlated with a positive profit. According to this analysis, the “diff_probs” feature seems to be correlated with the profit - that’s interesting for modelling purposes!
Handicap between 1-1.75 and Under bets are slightly more positively correlated with positive profit
It makes more sense to analyze the variable “selection x handicap” but that unfortunately makes the chart unreadable
# Remove useless variables with no information or variables that we don't know before betting
df <- as.data.table(df)
features_to_keep <- setdiff(names(df),c("market_id",
"match_start_time_gmt",
"league_id",
"home",
"away",
"home_score",
"away_score",
"observations",
"minute",
"final_home_score",
"final_away_score",
"minute",
"nb_goals",
"selection_handicap",
"positive_profit",
"profit_odd_ratio"))
df_model <- df[, .SD, .SDcols = features_to_keep]
# one-hot encode our categorical variables
one_hot <- dummyVars(~ ., df_model, fullRank = FALSE)
df_model_hot <- predict(one_hot, df_model) %>% as.data.frame()# Response variable renaming
df_model_hot <- df_model_hot %>%
rename(Y = profit)
# Separate into random training and test data sets ----
# set.seed() for reproducibility
set.seed(42)
# We keep 80% of the data for model training
split_strat <- initial_split(df_model_hot, prop = 0.8)
df_train <- training(split_strat)
df_test <- testing(split_strat)We implement a 5-fold cross validation manually (for loop) on the training set. This gives us a lot of flexibility and provides a good understanding of what goes on behind the scenes.
For large-scale machine learning projects, I use H2O which automates the cross validation and hyperparameter tuning process. It’s much faster and more scalable since H2O is written in Java.
# Training set
X <- as.data.table(df_train %>% select(-Y))
Y <- df_train$Y
XY <- cbind(Y,X)
names(XY)[1] <- "Y"
# Testing set
X_test <- as.data.table(df_test %>% select(-Y))
Y_test <- df_test$Y
XY_test <- cbind(Y_test, X_test)
names(XY_test)[1] <- "Y"
# Cross validation
# Higher CV metrics determine that our model does not suffer from high variance and generalizes well on unseen data
# Number of folds
N <- 5
# Folds building
set.seed(42)
Folds <- sample(1:dim(XY)[1] %% N + 1)
Output <- list()
Time <- list()
i <- 1
# for loop to perform cross validation
for(i in 1:N){
Predictions <- list()
# Estimation samples
XE <- X[Folds != i]
YE <- Y[Folds != i]
# Data needs to be in a matrix format for some models
XME <- as.matrix(XE)
YME <- as.matrix(YE)
XYE <- XY[Folds != i]
# Prediction samples
XP <- X[Folds == i]
YP <- Y[Folds == i]
# Data needs to be in a matrix format for some models
XMP <- as.matrix(XP)
YMP <- as.matrix(YP)
XYP <- XY[Folds == i]
print(stringr::str_c("Fold number ", i))
print("**************************************************")
# Add the Y variable to the prediction dataset (for comparison purposes)
Predictions[["Y"]] <- YP
Predictions[["Y"]] <- as.numeric(Predictions[["Y"]])
#********************* ORDINARY LEAST SQUARES **********************#
print(Nom <- "OLS (LM)")
Start <- proc.time()
Model <- stats::lm(formula = Y ~ ., data = XYE)
Response <- predict(Model, newdata = XP, type = "response")
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
#********************* REGULARIZED REGRESSIONS *********************#
print(Nom <- "OLS (GLMNET)")
Start <- proc.time()
Model <- glmnet::glmnet(x = XME, y = YME, lambda = 0, alpha = 0, family = "gaussian")
Response <- predict(Model, newx = XMP, type = "response")
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
print(Nom <- "Ridge (GLMNET)")
Start <- proc.time()
Model <- glmnet::cv.glmnet(x = XME, y = YME, alpha = 0, family = "gaussian")
Response <- predict(Model, s = "lambda.min", newx = XMP, type = "response")
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
# Regularized Regression to reduce the model variance
# alpha = 1 for lasso (push some coefficients to 0 for some variables)
# alpha = 0 for ridge (no coefficients at 0)
# λ is a tuning parameter that helps to control our model from over-fitting to the training data
# However, to identify the optimal λ value we need to perform cross-validation (CV)
# cv.glmnet provides a built-in option to perform k-fold CV, and by default, performs 10-fold CV
# the 10-fold CV mean squared error (MSE) across the λ values
# number of the top = number of variables
# The first vertical dashed lines represent the λ value with the minimum MSE
# and the second one the largest λ value within one standard error of the minimum MSE
# min(Model$cvm) # minimum MSE
# Model$lambda.min # lambda for this min MSE
# Model$cvm[Model$lambda == Model$lambda.1se] # 1 st.error of min MSE
# Model$lambda.1se # lambda for this MSE
# Influential variables
# coef(Model, s = "lambda.1se") %>%
# tidy() %>%
# filter(row != "(Intercept)") %>%
# ggplot(aes(value, reorder(row, value), color = value > 0)) +
# geom_point(show.legend = FALSE) +
# ggtitle("Influential variables") +
# xlab("Coefficient") +
# ylab(NULL)
print(Nom <- "Lasso (GLMNET)")
Start <- proc.time()
Model <- glmnet::cv.glmnet(x = XME, y = YME, alpha = 1, family = "gaussian")
Response <- predict(Model, s = "lambda.1se", newx = XMP, type = "response")
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
# coef(Model, s=Model$lambda.1se)
Alphas <- c(0.25, 0.5, 0.75)
for(Alpha in Alphas){
print(Nom <- stringr::str_c("Elastic net (GLMNET|alpha=", Alpha, ")"))
Start <- proc.time()
Model <- glmnet::cv.glmnet(x = XME, y = YME, alpha = Alpha, family = "gaussian")
Response <- predict(Model, s = "lambda.min", newx = XMP, type = "response")
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
}
#********************* TREE *******************************#
print(Nom <- "Tree (RPART)")
Start <- proc.time()
Model <- rpart::rpart(formula = Y ~ ., data = XYE)
Response <- predict(Model, newdata = XP, type = "vector")
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
# rpart.plot(Model)
#********************* RANDOM FORESTS *********************#
# tune the number of tree hyperparameter in the cross validation
Numbers_Of_Trees <- c( 100, 250, 500)
# make ranger compatible names
names(XYE) <- make.names(names(XYE), allow = FALSE)
names(XP) <- make.names(names(XP), allow = FALSE)
for (ntree in Numbers_Of_Trees) {
print(Nom <- stringr::str_c("Random forest (RANDOMFOREST|ntree=", ntree, ")"))
Start <- proc.time()
features <- setdiff(names(XYE), "Y")
Model <- ranger(formula = Y ~ .,
data = XYE,
num.trees = ntree, # number of trees
mtry = floor(length(features) / 3), # the number of variables
# to randomly sample as candidates at each split
# usal value for regression = p/3 with p the total number of variables
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 42)
Response <- predict(Model, data = XP, type = "response")
Response <- Response$predictions
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
}
#********************* BOOSTING *************************#
# Xgboost - apprx 10x faster than gbm
# For boosting tree models, the most important parameter is the number of trees to build - nrounds parameter
# XGBoost only works with matrices that contain all numeric variables
# tune the number of tree hyperparameter in the cross validation
Numbers_Of_Trees <- c(100, 250, 500)
for (nrounds in Numbers_Of_Trees){
print(Nom <- stringr::str_c("XGBoost|n.trees=", nrounds))
Model <- xgboost(data = XME,
label = YME,
nrounds = nrounds,
eta = 0.1, # learning rate 0.3 by default
max_depth = 4, # Maximum depth of a tree. 6 by default
objective = "reg:linear",
print_every = 1000) # hide intermediate results
Response <- predict(Model , XMP, missing = NA)
Predictions[[Nom]] <- as.numeric(Response)
Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
}
# #********************* SVM *****************************#
# NOT RUN
# tune Kernel and Cost hyper-parameters in the cross validation
# Kernels <- c("linear", "polynomial", "radial")
# Costs <- c(0.1, 1, 2)
#
# for(Kernel in Kernels)
# {
# for(Cost in Costs)
# {
# print(Nom <- stringr::str_c("SVM (e1071|kernel=", Kernel, "|cost=", Cost, ")"))
# Start <- proc.time()
# Model <- e1071::svm(Y ~ ., data = XYE, kernel = Kernel, cost = Cost)
# Response <- predict(Model, newdata = XP, type = "response")
# Predictions[[Nom]] <- as.numeric(Response)
# Time[[Nom]] <- ifelse(is.null(Time[[Nom]]), 0, Time[[Nom]]) + (proc.time() - Start)[3]
# }
# }
Output[[1 + length(Output)]] <- as.data.table(do.call(cbind, Predictions))
print("**************************************************")
}## [1] "Fold number 1"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [11:21:48] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.103074
## [100] train-rmse:0.933889
## [1] "XGBoost|n.trees=250"
## [11:21:50] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.103074
## [250] train-rmse:0.884063
## [1] "XGBoost|n.trees=500"
## [11:21:54] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.103074
## [500] train-rmse:0.821056
## [1] "**************************************************"
## [1] "Fold number 2"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [11:22:14] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.102835
## [100] train-rmse:0.938430
## [1] "XGBoost|n.trees=250"
## [11:22:15] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.102835
## [250] train-rmse:0.888034
## [1] "XGBoost|n.trees=500"
## [11:22:19] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.102835
## [500] train-rmse:0.821246
## [1] "**************************************************"
## [1] "Fold number 3"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [11:22:40] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.101184
## [100] train-rmse:0.939608
## [1] "XGBoost|n.trees=250"
## [11:22:41] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.101184
## [250] train-rmse:0.885817
## [1] "XGBoost|n.trees=500"
## [11:22:45] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.101184
## [500] train-rmse:0.822369
## [1] "**************************************************"
## [1] "Fold number 4"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [11:23:05] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.099292
## [100] train-rmse:0.935427
## [1] "XGBoost|n.trees=250"
## [11:23:06] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.099292
## [250] train-rmse:0.882825
## [1] "XGBoost|n.trees=500"
## [11:23:10] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.099292
## [500] train-rmse:0.817680
## [1] "**************************************************"
## [1] "Fold number 5"
## [1] "**************************************************"
## [1] "OLS (LM)"
## [1] "OLS (GLMNET)"
## [1] "Ridge (GLMNET)"
## [1] "Lasso (GLMNET)"
## [1] "Elastic net (GLMNET|alpha=0.25)"
## [1] "Elastic net (GLMNET|alpha=0.5)"
## [1] "Elastic net (GLMNET|alpha=0.75)"
## [1] "Tree (RPART)"
## [1] "Random forest (RANDOMFOREST|ntree=100)"
## [1] "Random forest (RANDOMFOREST|ntree=250)"
## [1] "Random forest (RANDOMFOREST|ntree=500)"
## [1] "XGBoost|n.trees=100"
## [11:23:30] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.102181
## [100] train-rmse:0.935351
## [1] "XGBoost|n.trees=250"
## [11:23:32] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.102181
## [250] train-rmse:0.882334
## [1] "XGBoost|n.trees=500"
## [11:23:36] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:1.102181
## [500] train-rmse:0.816980
## [1] "**************************************************"
# Stack all the outputs tables up
Output <- data.table::rbindlist(l = Output, use.names = TRUE, fill = TRUE)
# Measure execution time
Time <- data.table(Model = names(Time), Time = unlist(Time))
# Function to define the error metric : Mean Absolute Error
loss_function <- function(X,Y){
mean(abs(X-Y))
}
# Assess cross validation errors for each model
Errors <- apply(X = Output, MARGIN = 2, FUN = loss_function, Y = Output$Y)
Errors <- data.table(Model = names(Errors), error_rate = Errors)
Errors <- merge(x = Errors, y = Time, by = "Model", all.x = TRUE, all.y = FALSE)
Errors <- Errors[order(error_rate)]
ErrorsAfter analysis of the cross validation errors, the best model is Random Forest with 500 trees.
Let’s retrain on the full training set and predict on the testing set.
# Training set
X <- as.data.table(df_train %>% select(-Y))
Y <- df_train$Y
XY <- cbind(Y,X)
names(XY)[1] <- "Y"
# Testing set
X_test <- as.data.table(df_test %>% select(-Y))
Y_test <- df_test$Y
XY_test <- cbind(Y_test, X_test)
names(XY_test)[1] <- "Y"
# make ranger compatible names
names(XY) <- make.names(names(XY), allow = FALSE)
names(X_test) <- make.names(names(X_test), allow = FALSE)
features <- setdiff(names(XY), "Y")
# Best Model: Random Forest 500 trees according to the the cross validation errors
Model_final<- ranger(formula = Y ~ .,
data = XY,
num.trees = 500,
mtry = floor(length(features) / 3),
respect.unordered.factors = 'order',
# To plot variable imortance
importance = "impurity",
# Quantile regressions forest to assess prediction intervals
quantreg = TRUE,
verbose = FALSE,
seed = 42)# Predict on the test set
predictions_testRF <- predict(Model_final,
data = X_test,
type = "response")
predictions_test <- data.frame(cbind(Y_test, predictions_testRF$predictions))
# Performance on the test set
loss_function(X = predictions_testRF$predictions, Y = Y_test)## [1] 0.5855065
The Mean Absolute Error equals to 0.59 on the test set.
Let’s determine the main discriminant variables by plotting the variable importance.
# Plot Top important variables
var_imp_ranger<-as.data.frame(Model_final$variable.importance)
variables <-(as.vector((row.names(var_imp_ranger))))
var_imp_ranger <- cbind(variables,var_imp_ranger)
var_imp_ranger <- var_imp_ranger %>%
arrange(desc(`Model_final$variable.importance`))
p <- ggplot(var_imp_ranger,
aes(x = reorder(variables,`Model_final$variable.importance`) ,
y = `Model_final$variable.importance`,
fill = `Model_final$variable.importance`))+
geom_bar(stat="identity",
position="dodge",
aes(text = `Model_final$variable.importance`))+
coord_flip()+
ylab("Importance")+
xlab("")+
ggtitle("Variable Importance - Random Forest 500 Trees")+
guides(fill = F)+
scale_fill_gradient()+
theme_classic()
ggplotly(p, tooltip = "text")The most important variables are odds, elo_* variables and diff.probs. Our feature engineering has therefore increased the model performance!
More important than the average values given by the Random Forest, we will implement Quantile Prediction (Quantile Regression Forest) to get an idea about the prediction intervals. We will use these predicted quantiles to determine a betting strategy thereafter.
#BETTING STRATEGY ##Quantile Prediction
# Quantile Prediction
quantiles_reg_forests <- predict(Model_final,
data = X_test,
type = "quantiles",
quantiles = seq(0, 1, by = 0.1))
pred_intervals <- as.data.table(quantiles_reg_forests$predictions)
# Concatenate actual value (profit) and predicted quantiles
actual_profit_w_pred_intervals <- data.table(cbind(Y_test, pred_intervals))
ncols <- ncol(actual_profit_w_pred_intervals)
names(actual_profit_w_pred_intervals)[2:ncols] <- c("Q0", "Q10","Q20", "Q30", "Q40", "Q50",
"Q60", "Q70","Q80", "Q90", "Q100")
# Display the five first rows
actual_profit_w_pred_intervals %>% head(5)The Y_test value is the actual profit of the bet (test set). The other variables are the different predicted quantiles (deciles in our case) with the Random Forest Model.
Our betting strategy is based on Quantile Prediction. Given a quantile (eg quantile 10%), the strategy considers placing only bets where the predicted quantile (eg quantile 10%) values are greater than 0. It implies that there’s a probability of 90% (quantile 10%) that the profit is positive.
# Function to get the number of bets placed, the proportion of winning/losing bets
# and the overall profit for each quantile-based decision rule
summarise_bets_placed <- function(quantile){
dt <- copy(actual_profit_w_pred_intervals)
# Distinguish positive profits from negative profits
dt <- dt[, positive_profit := ifelse(Y_test > 0, 1, 0)]
dt <- dt[get(quantile) > 0,
.(nb_bets_placed = .N,
sum_profit = sum(Y_test)),
by = .(positive_profit)]
dt$decision_rule_quantile <- quantile
dt <- dt[,
c("prop_bets",
"sum_nb_bets_placed",
"sum_profit_quantile") := list(nb_bets_placed/sum(nb_bets_placed),
sum(nb_bets_placed),
sum(sum_profit)),
by= .(decision_rule_quantile)]
sdcols <- c("decision_rule_quantile", "positive_profit", "nb_bets_placed",
"prop_bets", "sum_nb_bets_placed", "sum_profit", "sum_profit_quantile")
dt <- dt[, .SD, .SDcols = sdcols]
return(dt)
}
quantiles <- list("Q10", "Q20", "Q30", "Q40", "Q50",
"Q60", "Q70", "Q80", "Q90", "Q100")
bets_placed_quantile <- map(quantiles,
summarise_bets_placed)
bets_placed_quantile <- rbindlist(bets_placed_quantile)
bets_placed_quantileThis table give us the number of bets to be placed, the proportion of winning/losing bets and the overall profit for each quantile-based decision rule.
Interpretation
Example_1: Quantile 10%-based Decision Rule. Our strategy is to bet over only if Quantile 10% > 0.
Only 161 bets will be placed. They are all winning bets with an overall profit of 47
Example_2: Quantile 60%-based Decision Rule. Our strategy is to bet over only if Quantile 60% > 0.
4091 bets will be placed. 3243 bets are winning and 848 losing. But the overall profit (1835.2) is higher than in Example_1
Example_3: Quantile 100%-based Decision Rule. Our strategy is to bet over if Quantile 100% > 0.
Thus, all the bets will be placed. There is no selection! 7762 bets placed with 3591 winning bets and 4171 losing bets. The overall profit is negative: -431.5
According to his risk aversion, the gambler can choose between:
1. Mitigate risk by placing very few bets and securing the profits - Decision Rule: Quantile 10% > 0
2. Try to optimize the overall profit - Decsion Rule: Quantile 50% > 0 or Quantile 60% > 0. In such a case, there will be losing bets but the overall profit seems to be higher.
Let’s apply our strategy on another random sample from test set to check the robustness of the strategy
##Strategy Robustness
# Random sample from test set
set.seed(42) # for reproducibility
size <- 2500
ind <- sample(1:nrow(X_test), size)
X_test_samp <- X_test[ind]
Y_test_samp <- Y_test[ind]
sum(Y_test_samp)## [1] -122.736
# Predict on the test set
quantiles_reg_forests <- predict(Model_final,
data = X_test_samp,
type = "quantiles",
quantiles = seq(0, 1, by = 0.1))
pred_intervals <- as.data.table(quantiles_reg_forests$predictions)
actual_profit_w_pred_intervals <- data.table(cbind(Y_test_samp, pred_intervals))
names(actual_profit_w_pred_intervals) <- c("Y_test","Q0", "Q10","Q20", "Q30", "Q40",
"Q50","Q60", "Q70","Q80", "Q90", "Q100")
quantiles <- list("Q10", "Q20", "Q30", "Q40", "Q50",
"Q60", "Q70", "Q80", "Q90", "Q100")
bets_placed_quantile_samp <- map(quantiles,
summarise_bets_placed)
bets_placed_quantile_samp <- rbindlist(bets_placed_quantile_samp)
bets_placed_quantile_sampQuantile 10%-based Decision Rule.
There are 56 bets placed (over 2500 possible bets) and an overall profit of 17
Here again, the optimal strategy seems to be the Quantile 60%-based Decision Rule.
1305 bets placed: 1027 winning and 278 losing bets. The overall profit is 581.7
Out of intellectual curiosity, let’s consider the worst case scenario: what if we only keep bets with negative profit?
What would be the results of our strategy based on quantiles in such a case?
##Worst-Case Scenario
# Worst-Case Scenario - we only keep bets with negative profit
ind <- which(Y_test < 0)
X_test_wc <- X_test[ind]
Y_test_wc <- Y_test[ind]
# Quuntile Predictions with Random Forest
quantiles_reg_forests <- predict(Model_final,
data = X_test_wc,
type = "quantiles",
quantiles = seq(0, 1, by = 0.1))
pred_intervals <- as.data.table(quantiles_reg_forests$predictions)
actual_profit_w_pred_intervals <- data.table(cbind(Y_test_wc, pred_intervals))
names(actual_profit_w_pred_intervals) <- c("Y_test","Q0", "Q10","Q20", "Q30", "Q40",
"Q50","Q60", "Q70","Q80", "Q90", "Q100")
quantiles <- list("Q10", "Q20", "Q30", "Q40", "Q50",
"Q60", "Q70", "Q80", "Q90", "Q100")
bets_placed_quantile_wc <- map(quantiles,
summarise_bets_placed)
bets_placed_quantile_wc <- rbindlist(bets_placed_quantile_wc)
bets_placed_quantile_wcIn this worst case, bets are only placed from the quantile 30%-based Decision Rule. Thus, a risk-averse gambler won’t play and then won’t lose money.
Our strategy can therefore be used to mitigate risk and avoid significant financial losses.
A work by Benjamin DAVID